Vector Embedding Introduction¶

  • Thomas Fuchs, Data Scientist
  • Georg M. Sorst, Team Lead Search

No description has been provided for this image

Formerly known as:

No description has been provided for this image

Keyword Search¶

Keyword search compares query words with document words.

🔎 notebook

📄 This notebook is perfect for your business needs ✅

But it has no concept of similarity.

🔎 notebook

📄 Enjoy the latest games with this high-performance laptop ❌

Vector Search¶

Vector search looks for similar words.

🔎 notebook

📄 This notebook is perfect for your business needs ✅

📄 Enjoy the latest games with this high-performance laptop ✅

Vector Embeddings¶

Special Neural Nets can transform text into vectors.

These vectors can be embedded into a common vector space.

This makes it possible to discover semantic relationships between texts.

Let's define some words.

In [3]:
words = [
    "queen",
    "king",
    "prince",
    "princes",
    "man",
    "woman",
    "boy",
    "girl",
    "red",
    "green",
    "blue",
    "palace",
]

Transforming words into vectors is easy with Python.

Many free models exist to perform vector embedding.

In [4]:
def embed(texts):
    model_name = "all-MiniLM-L6-v2"
    model = SentenceTransformer(
        model_name,
        device=helpers.get_torch_device_name(),  # Optional: if you want to run this on GPU
    )
    return model.encode(texts)

The resulting vector is represented as a multidimensional array in Python.

All vectors share the same dimensionality, for this model it's 384 dimensions.

In [5]:
pd.DataFrame(embed(words[0])) #words[0] == Quen
Out[5]:
0
0 0.035487
1 -0.065605
2 -0.009935
3 0.031590
4 -0.013387
... ...
379 0.026038
380 0.091385
381 -0.053889
382 -0.031242
383 -0.086961

384 rows × 1 columns

Each word is transformed into a vector so that we can discover semantic relationships.

In [6]:
pd.DataFrame(
    {"Sentence": words, "Encoding": list(embed(words))}
).head(3)
Out[6]:
Sentence Encoding
0 queen [0.03548697, -0.06560468, -0.009934984, 0.0315...
1 king [-0.059599347, 0.050512385, -0.06951009, 0.079...
2 prince [-0.036828794, 0.04128195, 0.04185658, 0.04177...

Let's visualize the vectors to show their relationships.

But 384-dimensional vectors cannot be plotted on a 2-dimensional screen.

Principal Component Analysis (PCA) can reduce the number of dimensions from 384 to 2.

In [7]:
word_samples = words[0:3]
embeddings = embed(word_samples)
reduced_embeddings = PCA(n_components=2).fit_transform(embeddings)
pd.DataFrame(
    {"Words": word_samples, "Encoding": list(reduced_embeddings)}
)
Out[7]:
Words Encoding
0 queen [-0.2934047, -0.38926467]
1 king [-0.2532008, 0.4088319]
2 prince [0.54660547, -0.019567199]

When visualizing the reduced vectors, clear semantic clusters appear.

In [9]:
plot(words, embed(words))

Sentence Embeddings¶

We can not only embed words but entire sentences and documents.

Let's define some documents and embed them.

In [10]:
import pandas as pd

documents = [
    "Vector embeddings are mathematical representations of objects, often words or phrases, in a high-dimensional space. By mapping similar objects to proximate points, embeddings capture relationships and semantic meaning. Commonly used in machine learning and natural language processing tasks, methods like Word2Vec, GloVe, and FastText have popularized their application, enabling advancements in text analysis, recommendation systems, and more.",
    "Keyword search refers to the process of locating information in a database, search engine, or other data repository by specifying particular words, phrases, or symbols. In the digital realm, it's foundational to search engines like Google and Bing. The search results are typically ranked based on relevance, which is determined using various algorithms that consider factors like frequency, location, and link structures. Keyword search is integral for navigating the vast expanse of online information, aiding users in retrieving relevant data efficiently.",
    "Sandwiches are a popular type of food consisting of one or more types of food, such as vegetables, sliced meat, or cheese, placed between slices of bread. They can range from simple combinations like peanut butter and jelly to more complex gourmet creations. Originating from England in the 18th century, sandwiches have become a staple in many cultures worldwide, prized for their convenience and versatility. Variations exist based on regional preferences, ingredients, and preparation methods.",
    "Data science is an interdisciplinary field that leverages statistical, computational, and domain-specific expertise to extract insights and knowledge from structured and unstructured data. It encompasses various techniques from statistics, machine learning, data mining, and big data technologies to analyze and interpret complex data. Data science has applications across numerous sectors, including healthcare, finance, marketing, and social sciences, driving decision-making, predictive analytics, and artificial intelligence advancements. Its growing significance in today's data-driven world has led to the rise of specialized tools, methodologies, and educational programs.",
    "Neural networks are a class of machine learning models inspired by the biological neural networks of animal brains. They consist of interconnected layers of nodes, or neurons, which process input data through a series of transformations and connections to produce output. Neural networks are particularly adept at recognizing patterns, making them useful for a wide range of applications such as image and speech recognition, natural language processing, and predictive analytics. The development of deep neural networks, which contain multiple hidden layers, has been central to the field of deep learning and has significantly advanced the capabilities of artificial intelligence systems.",
    "Pasta is a staple food of traditional Italian cuisine, with the first reference dating to 1154 in Sicily. It is typically made from an unleavened dough of durum wheat flour mixed with water or eggs and formed into sheets or various shapes, then cooked by boiling or baking. Pasta is versatile and can be served with a variety of sauces, meats, and vegetables. It is categorized in two basic styles: dried and fresh. Popular around the world, pasta dishes are central to many diets and come in numerous shapes like spaghetti, penne, and ravioli.",
    "Soup is a liquid food, generally served warm or hot (but also cold), that is made by combining ingredients such as meat and vegetables with stock, juice, water, or another liquid. Soups are inherently diverse, ranging from rich, cream-based varieties to brothy and vegetable-laden concoctions. They are often regarded as comfort food and can be served as a main dish or as an appetizer, with regional and cultural variations like the Spanish gazpacho, Japanese miso soup, and Russian borscht.",
    "A casserole is a comprehensive one-dish meal baked in a deep, ovenproof dish with a glass or ceramic base. It typically includes a combination of meats, vegetables, starches like rice or potatoes, and a binding agent like a soup or sauce. Topped with cheese or breadcrumbs for a crispy crust, casseroles are appreciated for their convenience and the ability to meld flavors during the baking process. They are a fixture in many cultures and are particularly beloved as home-cooked comfort foods, often featuring in communal gatherings and family dinners.",
]

pd.DataFrame(
    {"Sentence": documents, "Encoding": list(embed(documents))}
).head(3)
Out[10]:
Sentence Encoding
0 Vector embeddings are mathematical representat... [-0.0016682206, -0.069409974, -0.026505154, 0....
1 Keyword search refers to the process of locati... [0.019650526, -0.06271499, -0.045780797, -0.00...
2 Sandwiches are a popular type of food consisti... [-0.04432275, -0.023782436, 0.036511302, -0.01...

Again, semantic clusters appear when visualizing the vectors in a 2D-space.

In [12]:
plots([(documents, embed(documents), "green")])

Information Retrieval¶

Similar documents have similar vectors.

This characteristic can be used to retrieve related documents for an input text.

Let's start by defining some search queries.

In [13]:
queries = [
    "information retrieval",
    "machine learning",
    "cooking",
]
plots([(queries, embed(queries), "red")])

Visualizing documents and queries in one space uncovers semantic relations.

Each query is closests to its most relevant documents.

In [14]:
plots(
    [
        (documents, embed(documents), "green"),
        (queries, embed(queries), "red"),
    ]
)

Simply Search¶

Load Data¶

Firstly, we need to load data. To do this, we use the product data from a customer.

In [16]:
df_products.head(3)
Out[16]:
productId name description
0 4544264339511 Chelsie Medium Wash Jeans FINAL SALE Details\nMedium wash denim jeans\nFabric has s...
1 4544264732727 Nothing But The Best White Bodysuit Details White long sleeve bodysuit Fabric has ...
2 4544267976759 Chelsie Dark Wash Jeans FINAL SALE Details\nDark wash denim skinny jeans\nFabric ...

Embed Data¶

The next step is to embed the data.
In a real-world scenario, we use specialised programs such as ElasticSearch to apply embeddings and assign weights to different fields.
The advantage of using embeddings is that different fields can be combined in advance to achieve a favourable result.

In [17]:
combine_fields = (
    lambda x: f"Product name = {x['name']}\n"
    f"Description = {x['description']}\n"
)
In [19]:
df_for_search.head(5).style
Out[19]:
  base_string embeddings
0 Product name = Chelsie Medium Wash Jeans FINAL SALE Description = Details Medium wash denim jeans Fabric has some stretch High waist style with frayed cuffs and button up style Pair these jeans with a cute blouse or bodysuit Unlined Size 25 inseam: 27" Material and Care 65% Cotton, 30% Polyester, 3% Rayon, 2% Spandex Machine wash cold inside out / Tumble dry low Patterns may vary Materials may have natural variations Colors may vary from different viewing devices. [-0.0009643 -0.00200403 -0.01168522 ... -0.02048676 0.00592895 -0.02616584]
1 Product name = Nothing But The Best White Bodysuit Description = Details White long sleeve bodysuit Fabric has stretch, lightweight, fitted bodycon style Balloon sleeves with a v-neckline, ruched shoulders, and snap button closure Pair this cute bodysuit with a skirt and booties Unlined Size small from shoulder to hem: 28" Material and Care 92% Rayon, 8% Spandex Hand wash cold / Dry Flat Patterns may vary Materials may have natural variations Colors may vary from different viewing devices. [-0.00222488 -0.0072418 -0.01070451 ... -0.026238 0.02467378 -0.03861708]
2 Product name = Chelsie Dark Wash Jeans FINAL SALE Description = Details Dark wash denim skinny jeans Fabric has some stretch High waisted style with frayed cuffs and button detail closure Pair these cute jeans with a sweater or bodysuit Unlined Size 3 from waist to hem: 37" Material and Care 65% Cotton, 30% Polyester, 3% Rayon, 2% Spandex Machine wash cold inside out / Tumble dry low Patterns may vary Materials may have natural variations Colors may vary from different viewing devices. [-0.00644412 -0.00571303 0.00532717 ... -0.01751096 0.00276815 -0.02006898]
3 Product name = Kristy Dark Wash Distressed Skinny Jeans FINAL SALE Description = Details Dark wash distressed skinny jeans Fabric has some stretch Distressed hem with a skinny jean style Pair these jeans with a sweater and booties Unlined Size 3 from waist to hem: 35" Material and Care 71.9% Cotton, 23.8% Polyester, 2.8% Rayon, 1.5% Spandex Machine wash cold inside out / Tumble dry low Patterns may vary Materials may have natural variations Colors may vary from different viewing devices. [-0.0116023 -0.02009334 -0.02260989 ... -0.01975745 -0.00099181 -0.00597608]
4 Product name = Chelsie White Jeans FINAL SALE Description = Details White denim jeans Fabric has some stretch High waist style with frayed cuffs and button up style Pair these jeans with a cute blouse or bodysuit Unlined Size 3 from waist to hem: 37" Material and Care 87.4% Cotton, 9.6% Polyester, 3% Spandex Machine wash cold inside out / Tumble dry low Patterns may vary Materials may have natural variations Colors may vary from different viewing devices. [-0.00319793 0.00712412 0.00121423 ... -0.00876944 0.00661256 -0.02469784]

Calculate Similarity¶

The most frequently employed method for assessing similarity is through cosine_similarity or cosine distance.
$ \text{cosinus-similarity}=S_{C}(A,B):=\cos(\theta)={\mathbf {A} \cdot \mathbf {B} \over \|\mathbf { A} \|\|\mathbf{B}\|}= \frac{\sum \limits_{i=1}^{n}{A_{i}B_{i}}}{{\sqrt {\sum \limits_{i=1}^{n}{A_{i}^{2}}}}{\sqrt {\sum \limits_{i=1}^{n}{B_{i}^{ 2}}}}}\in [-1,1]$
$\text{cosinus-distance}=D_{C}(A,B):=1-S_{C}(A,B)$
ATTENTION: not really a distance-metric
We can leverage the pre-existing functionality provided by sklearn for this purpose.

Approximate calculation of similarity¶

With ANNOY (Approximate Nearest Neighbors Oh Yeah) we can significantly increase the efficiency of our search processes.
To achieve this, we create an index that is not only very powerful, but also compact.

In [24]:
ann_index: AnnoyIndex = get_annoy_index(df_for_search, n_trees=20)
In [27]:
fig_annoy.show()

time comparison¶

Query with Annoy¶

In [29]:
query_easy = "hoodie"
model.encode(query_easy)
Out[29]:
array([-0.01272601, -0.00032983, -0.01062606, ..., -0.01671979,
       -0.01496496, -0.01919532], dtype=float32)
In [31]:
display(HTML(html_easy))

ANNOY VectorSearch for:
'hoodie'

No description has been provided for this image
Better Than Ever Green Washed Hoodie FINAL SALE
No description has been provided for this image
Have It My Way Grey Textured Knit Zip Up Hooded Sweatshirt
No description has been provided for this image
Have It My Way Pink Textured Knit Zip Up Hooded Sweatshirt FINAL SALE
No description has been provided for this image
Cozy Charm Grey Textured Sherpa Zip Up Jacket FINAL SALE
No description has been provided for this image
Chill Factor Purple Quarter Zip Pullover Sweatshirt FINAL SALE

Complex Query with Annoy¶

In [32]:
query_complex = "For my Frau, I need a schwarze Jumpsuit"
In [34]:
display(HTML(html_complex))

ANNOY VectorSearch for:
'For my Frau, I need a schwarze Jumpsuit'

No description has been provided for this image
Perfectly Poised White Jumpsuit FINAL SALE
No description has been provided for this image
Jump For Joy Black Gauze Jumpsuit FINAL SALE
No description has been provided for this image
Jump For Joy Lavender Gauze Jumpsuit
No description has been provided for this image
Destined To Impress Mocha Strapless Ruffle Jumpsuit
No description has been provided for this image
On Your Own Time Blue Velvet Short Sleeve Jumpsuit FINAL SALE

Same query - index with more tree --- few results¶

In [38]:
display(HTML(html_complex_more_tree))

ANNOY VectorSearch for:
'For my Frau, I need a schwarze Jumpsuit'

No description has been provided for this image
Perfectly Poised White Jumpsuit FINAL SALE
No description has been provided for this image
Jump For Joy Black Gauze Jumpsuit FINAL SALE
No description has been provided for this image
Joy To The World Black Velvet Long Sleeve Belted Jumpsuit FINAL SALE
No description has been provided for this image
So Over Love Songs V-Neck Fuchsia Jumpsuit FINAL SALE
No description has been provided for this image
Can't Stop Now Black Square Neck Jumpsuit FINAL SALE

Vector Search: Advantages and Disadvantages¶

Disadvantages¶

  1. Complexity:
    • Implementation and optimization of vector search algorithms can be complex, requiring specialized knowledge.
  2. Resource Intensive:
    • Computationally intensive, demanding significant computing resources for large-scale applications.
  3. Quality of Embeddings:
    • The effectiveness of vector search heavily depends on the quality of the embeddings, which may require fine-tuning.
  4. Interpretability:
    • Results may lack interpretability, making it challenging to understand the reasoning behind specific search outcomes.

Advantages¶

  1. Efficiency:
    • Vector search allows for fast and efficient similarity searches in high-dimensional spaces.
  2. Scalability:
    • Well-suited for large datasets and can scale effectively with the growing volume of data.
  3. Flexibility:
    • Adaptable to various data types, making it versatile for different domains such as image, text, and audio.
  4. Semantic Understanding:
    • Captures semantic relationships, enabling more meaningful and context-aware search results.